Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

نویسنده

Juan-Manuel Torres-Moreno

چکیده

In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results con rm an increase in the performance, regardless of summarizer system used.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization

Preprocessing is a preliminary step in many fields including IR and NLP. The effect of basic preprocessing settings on English for text summarization is well-studied. However, there is no such effort found for the Urdu language (with the best of our knowledge). In this study, we analyze the effect of basic preprocessing settings for single-document text summarization for Urdu, on a benchmark co...

متن کامل

A Feature Terms based Method for Improving Text Summarization with Supervised POS Tagging

Text summarization is the process of distilling the most important information from a source to produce an abridged version for a particular user and task. When this is done by means of a computer, i.e. automatically, it calls as Automatic Text Summarization. Summarization can be classified into two approaches: extraction and abstraction. Extraction based summaries are produced by concatenating...

متن کامل

Automatic Semantic Subject Indexing of Web Documents in Highly In ected Languages

Structured semantic metadata about unstructured web documents can be created using automatic subject indexing methods, avoiding laborious manual indexing. A succesful automatic subject indexing tool for the web should work with texts in multiple languages and be independent of the domain of discourse of the documents and controlled vocabularies. However, analyzing text written in a highly in ec...

متن کامل

Improved speech summarization with multiple-hypothesis representations and kullback-leibler divergence measures

Imperfect speech recognition often leads to degraded performance when leveraging existing text-based methods for speech summarization. To alleviate this problem, this paper investigates various ways to robustly represent the recognition hypotheses of spoken documents beyond the top scoring ones. Moreover, a new summarization method stemming from the Kullback-Leibler (KL) divergence measure and ...

متن کامل

Automatic Semantic Subject Indexing of Web Documents in Highly Inflected Languages

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1209.3126 شماره

صفحات -

تاریخ انتشار 2012

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

نویسنده

چکیده

منابع مشابه

Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization

A Feature Terms based Method for Improving Text Summarization with Supervised POS Tagging

Automatic Semantic Subject Indexing of Web Documents in Highly In ected Languages

Improved speech summarization with multiple-hypothesis representations and kullback-leibler divergence measures

Automatic Semantic Subject Indexing of Web Documents in Highly Inflected Languages

عنوان ژورنال:

اشتراک گذاری